A short description of the post.
During 20-21 Jan 2014, on the island country of Kronos, several employees of GAStech, a Tethys gas multinational, go missing. Who is missing? Where have they gone? Were they kidnapped? If so, who is responsible?
To get to the bottom of this mystery, I will use visual analytic techniques to analyze data provided by GAStech to assist with law enforcement’s investigation and hopefully find the missing persons and bring them home safely. The data provided by GAStech covering the two weeks prior to the GAStech employees’ disappearance are as follows:
Movement and tracking data from GAStech’s company cars which have GPS tracking
GAStech employee’s credit card transactions and Kronos Kares loyalty card data.
The objective of this project is to use interactive visual analytic techniques to surface and identify anomalies and suspicious behavior. More precisely, I aim to shed light on the following questions:
Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?
Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?
Can you infer the owners of each credit card and loyalty card? What is your evidence? Where are there uncertainties in your method? Where are there uncertainties in the data?
Given the data sources provided, identify potential informal or unofficial relationships among GASTech personnel. Provide evidence for these relationships.
Do you see evidence of suspicious activity? Identify 1- 10 locations where you believe the suspicious activity is occurring, and why.
I reviewed the submissions for VAST Challenge 2014 for a better appreciation of approaches and techniques adopted to solved Mini-Challenge 2. I summarise below useful methodologies I wish to consider for my assignment.
The DAC-MC2 team from Virginia Tech used a methodology called Points of Interest (“POI”) to identify POIs such as people’s homes, their work places, and recreational locations (e.g. restaurants, cafes). A location is considered a POI if the time spent at a location is more than 5 minutes and the location has a diameter of less than 50 meters. They then graphed the distribution of POI over time for various days (weekdays and weekends) and locations (i.e. home, work, recreation).
The packages I use in my analysis are listed below.
#Load packages
packages = c("raster", "sf", "clock", "tmap",
"tidyverse", "rgdal", "htmlwidgets", "leaflet",
"geosphere", "sqldf", "ggwordcloud", "ggforce",
"sessioninfo", "plotly", "lubridate", "viridis",
"ggraph", "igraph", "ggiraph", "DT")
for (p in packages){
if(!require(p,character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
I load the data used for the analysis.
# Import data
df_cars <- read_csv("data/car-assignments.csv")
df_gps <- read_csv("data/gps.csv")
df_cc <- read_csv("data/cc_data.csv")
df_loyalty <- read_csv("data/loyalty_data.csv")
df_meetups <- readRDS("data/df_meetups.rds")
I noticed that Katerina’s Café was causing issues with R likely due to the apostrophe and é characters. I clean this to prevent issues.
I concatenate LastName and FirstName for easier reporting.
# Create new column full_name
df_cars <- df_cars %>%
unite(full_name, c("LastName", "FirstName"), sep = " ")
I use date_time_parse from the Clock R package to format the timestamps into POSIXct datatypes.
# Format timestamps
df_cc$day_timestamp <- date_time_parse(df_cc$timestamp,
zone = "",
format = "%m/%d/%Y")
df_cc$timestamp <- date_time_parse(df_cc$timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
df_loyalty <- df_loyalty %>%
mutate(timestamp = date_time_parse(timestamp,
zone = "",
format = "%m/%d/%Y")) %>%
rename(day_timestamp = timestamp)
df_gps$day_timestamp <- date_time_parse(df_gps$Timestamp,
zone = "",
format = "%m/%d/%Y")
df_gps$Timestamp <- date_time_parse(df_gps$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M")
I use get_hour from the Clock R package and wday from the Lubridate R package to get hour of day and day of week. I later use these fields for visualizing various time dimensions.
# Get various dates and times
df_cc$hour_of_day <- get_hour(df_cc$timestamp)
df_cc$day_of_week <- wday(df_cc$timestamp, label = TRUE, abbr = FALSE)
df_loyalty$day_of_week <- wday(df_loyalty$day_timestamp,
label = TRUE,
abbr = FALSE)
df_gps$hour_of_day <- get_hour(df_gps$Timestamp)
df_gps$day_of_week <- wday(df_gps$Timestamp, label = TRUE, abbr = FALSE)
I use the sf R package to turn the longitude and latitude coordinate points into geometries.
# Format coordinates
sf_gps <- st_as_sf(df_gps,
coords = c("long", "lat"),
crs = 4326)
I join credit card data with loyalty card data, and I join car assignments data with gps data.
I use an inner join to join credit card data and loyalty card data to filter out transactions with discrepancies (e.g. price mismatch) because this just introduces uncertainty. By doing an inner join, I filter out 403 rows from credit card data and 305 rows from loyalty card data.
On the other hand, I use a left join to join gps data and car assignment data. I do this because trucks do not have CarIDs, and I do not want to filter them out.
# Join credit card and loyalty data
df_cc_loyalty <- df_cc %>%
inner_join(df_loyalty, by = c("day_timestamp" = "day_timestamp",
"location" = "location",
"price" = "price"))
# Join gps and cars data
df_gps_cars <- df_gps %>%
left_join(df_cars, by = c("id" = "CarID"))
# Join gps and cars data
sf_gps_cars <- sf_gps %>%
left_join(df_cars, by = c("id" = "CarID"))
To determine the most popular locations, I use interactive heatmap visualizations of credit card and loyalty card data. I visualize this data by location and across three time dimensions where possible: hour of day, date, and day of week.
Note: For brevity, I will only elaborate and surface the code for one heatmap. The general approach I’ve taken is the same across heatmaps.
I create these global variables to sort the locations on the y-axis of the heatmaps.
To generate interactive heatmaps, I use ggplot and ploty R packages. First, I prepare the data by counting the number of transactions or rows. This # will later represent the fill or shading in the heatmap. Next, I pass the data to ggplot where I set up the graph and customize the aesthetics. Lastly, I pass the ggplot object to plotly to make the graph interactive. The code below is an example snippet.
# Hour of day
df_hour_of_day <- df_cc %>%
count(location, hour_of_day) %>%
mutate(location = as.factor(location),
hour_of_day = as.factor(hour_of_day),
text = paste0("Location: ", location, "\n", "Hour: ", hour_of_day, "\n", "n: ", n))
hm_hour_of_day <- ggplot(df_hour_of_day, aes(hour_of_day,
location,
fill = n,
text = text)) +
geom_tile() +
scale_fill_viridis(discrete = FALSE) +
scale_y_discrete(limits = cc_locations) +
scale_x_discrete() +
ggtitle("# of Credit Card Transactions by Hour of Day") +
xlab("Hour of Day") +
theme(panel.grid.major = element_blank(),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text.x = element_text(size = 6),
axis.text.y = element_text(size = 6),
axis.title.y = element_blank())
ggplotly(hm_hour_of_day, tooltip = "text")
Note: Timestamps in the loyalty data set did not have hours and minutes. Therefore, I was unable to produce a heatmap for loyalty data by hour of day.
First, I’m going to tackle the discrepancies between credit and loyalty card data.
I prepare the data by getting the counts for credit card data and loyalty card data by location and date. I then join the two data together by joining on location and day_timestamp. Once joined together, I calculate the difference between the # of credit card records and loyalty card records. The difference will the represent the fill or shading in this heatmap. Where the difference is 0, I turn the value into NA to filter out from the heatmap.
# Prep data for heatmap diff
df_cc_helper <- df_cc %>%
count(location, day_timestamp)
df_loyalty_helper <- df_loyalty %>%
count(location, day_timestamp)
df_diff <- df_cc_helper %>%
full_join(df_loyalty_helper,by = c("location" = "location",
"day_timestamp" = "day_timestamp")) %>%
rename(n_cc = n.x,
n_loyalty = n.y) %>%
replace_na(list(n_cc = 0, n_loyalty = 0)) %>%
mutate(diff = n_cc - n_loyalty,
diff = na_if(diff, 0))
df_diff <- df_diff %>%
filter(!is.na(diff)) %>%
mutate(location = as.factor(location),
day_timestamp = as.factor(day_timestamp),
text = paste0("Location: ", location, "\n", "Date: ", day_timestamp, "\n", "Diff: ", diff))
Once the data is prepared, I use the same code as above to generate the following heatmap.
To compare the results from credit card and loyalty card data with vehicle data, I must georeference the map of Abila and determine what GPS coordinates to plot.
I use QGIS to georeference the map of Abila. I then load the georeferenced map using the Raster R package.
bgmap <- raster("data/MC2-tourist_modified.tif")
To determine what GPS coordinates to plot, I take a page from the DAC-MC2 team from Virginia Tech and calculate “Points of Interest” or POIs. We know that the tracking devices are tracking the car’s location periodically as long as the car is moving. Thus, when there is a gap in the data, we know the car must have stopped or parked. I take advantage of this feature of the data to calculate how long a car was stopped or parked. If a car was stopped or parked for more than 5 minutes, I consider it as a POI.
# Points of Interest (POIs)
sf_poi <- sf_gps_cars %>%
group_by(id) %>%
mutate(DepartureTimestamp = lead(Timestamp, order_by = id)) %>%
mutate(Timestamp_diff_seconds = DepartureTimestamp - Timestamp) %>%
mutate(is_poi = ifelse(Timestamp_diff_seconds >= 60 * 5, TRUE, FALSE)) %>%
filter(is_poi == TRUE)
sf_poi <- rename(sf_poi, ArrivalTimestamp = Timestamp)
I graph the POIs onto the georeferenced map.
gps_dots_selected <- sf_poi %>%
mutate(MinutesDuration = round(Timestamp_diff_seconds / 60, 2)) %>%
select(id,
day_timestamp,
hour_of_day,
day_of_week,
ArrivalTimestamp,
DepartureTimestamp,
MinutesDuration,
full_name,
CurrentEmploymentType,
CurrentEmploymentTitle)
tmap_mode("view")
m <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b =3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_dots_selected) +
tm_dots(col = 'red', border.col = 'gray', size = .1, alpha = 0.3, jitter = 1) +
tm_facets(by = "day_timestamp", ncol = 1)
m
gps_dots_selected <- sf_poi %>%
mutate(MinutesDuration = round(Timestamp_diff_seconds / 60, 2)) %>%
select(id,
day_timestamp,
hour_of_day,
day_of_week,
ArrivalTimestamp,
DepartureTimestamp,
MinutesDuration,
full_name,
CurrentEmploymentType,
CurrentEmploymentTitle)
tmap_mode("view")
m <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b =3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_dots_selected) +
tm_dots(col = 'red', border.col = 'gray', size = .1, alpha = 0.3, jitter = 1) +
tm_facets(by = "day_of_week", ncol = 1)
m
To infer the owners of each credit card and loyalty card using visual analytics, I propose visualizing an interactive map with POIs and an interactive data table of credit card and loyalty card data. Using these two as visual tools, I propose triangulating the purchases with the POIs. The person who made the purchase at a location should have their GPS at that same location and time.
There are some important assumptions worth mentioning with this approach. First, the credit card data and GPS tracker data accurately represent the truth. That is, the timestamp logged for a transaction is the exact time the purchase was made, and the GPS tracker was functioning correctly.
Second, the vehicle was being driven by the person assigned to it and no one else. In addition, the person who drove the vehicle is the one who made the purchase.
I clean the credit card and loyalty card data and prepare to visualize it in a table format.
df_cc_cleaned <- df_cc_loyalty %>%
mutate(last4ccnum = as.character(last4ccnum),
timestamp = format(timestamp, "%m/%d/%Y %H:%M")) %>%
select(
timestamp,
location,
last4ccnum,
loyaltynum
)
DT::datatable(df_cc_cleaned,
class = 'cell-border',
filter = 'top',
caption = htmltools::tags$caption(
style = 'caption-side: bottom; text-align: center;',
'Table: ', htmltools::em('Credit card and Loyalty card data.'))) %>%
formatDate(1, method = 'toLocaleString', params = list(month = 'numeric',
day = 'numeric',
year = 'numeric',
hour = 'numeric',
minute = 'numeric')) %>%
formatStyle(0,
target = 'row',
lineHeight = '25%')
I must recalculate POIs again because, for reasons discussed later on, the data cannot have geometries at this point.
df_poi <- df_gps_cars %>%
group_by(id) %>%
mutate(DepartureTimestamp = lead(Timestamp, order_by = id)) %>%
mutate(Timestamp_diff_seconds = as.numeric(DepartureTimestamp - Timestamp)) %>%
mutate(is_poi = ifelse(Timestamp_diff_seconds >= 60 * 5, TRUE, FALSE)) %>%
filter(is_poi == TRUE)
df_poi <- rename(df_poi, ArrivalTimestamp = Timestamp)
I use the sqldf R package to join df_poi onto itself. This package is helpful in this situation because the join condition compares dates and is not an equality. By joining on ArrivalTimestamp and DepartureTimestamp (as seen below), I get the records where the POIs overlap in time. It may be possible to accomplish a similar effect with the dplyr package, but I found it accomplished much easier using sqldf.
Note: sqldf does not work with geometries.
df_meetups = sqldf(
"select
*,
a.id || b.id ||
a.ArrivalTimestamp || b.ArrivalTimestamp ||
a.DepartureTimestamp || b.DepartureTimestamp ||
a.lat || b.lat || a.long || b.long as key
from df_poi as a
inner join df_poi as b
on a.id != b.id
and a.ArrivalTimestamp < b.DepartureTimestamp
and a.DepartureTimestamp > b.ArrivalTimestamp")
I split df_meetups into “a” and “b” data sets. I do this because st_as_sf will only create one geometry per df. Thus, I must convert the coordinates for a and b separately and bring them together. I concatenate data from several columns to create a key which I later use to join the data back together.
df_meetups_a = sqldf(
"select
a.*,
a.id || b.id ||
a.ArrivalTimestamp || b.ArrivalTimestamp ||
a.DepartureTimestamp || b.DepartureTimestamp ||
a.lat || b.lat ||
a.long || b.long as key
from df_poi as a
inner join df_poi as b
on a.id != b.id
and a.ArrivalTimestamp < b.DepartureTimestamp
and a.DepartureTimestamp > b.ArrivalTimestamp")
df_meetups_b = sqldf(
"select
b.*,
a.id || b.id ||
a.ArrivalTimestamp || b.ArrivalTimestamp ||
a.DepartureTimestamp || b.DepartureTimestamp ||
a.lat || b.lat ||
a.long || b.long as key
from df_poi as a
inner join df_poi as b
on a.id != b.id
and a.ArrivalTimestamp < b.DepartureTimestamp
and a.DepartureTimestamp > b.ArrivalTimestamp")
I use the sf R package to turn the longitude and latitude coordinate points into geometries.
I bring the “a” and “b” data sets each with their own geometries together. dplyr’s inner_join doesn’t allow the first and second data sets to be sf objects. To get around this, I create a dummy table “CurrentEmploymentType” so “a” and “b” are 2nd and 3rd respectively, and this lets me join the two data sets together using dplyr’s inner_join.
# Dummy table
CurrentEmploymentType <- unique(df_cars$CurrentEmploymentType)
CurrentEmploymentType <- data.frame(CurrentEmploymentType)
df_meetups <- CurrentEmploymentType %>%
inner_join(df_meetups_a, by = c("CurrentEmploymentType")) %>%
inner_join(df_meetups_b, by = c("key"))
I calculate the distance in meters between the two geometries. I save the data into a .rds file because this code takes a long time to execute. In future executions, I simply have to read this .rds file.
# Calculate distance between two geometries
df_meetups <- df_meetups %>%
rowwise() %>%
mutate(distance = st_distance(geometry.x, geometry.y)) %>%
mutate(distance = as.numeric(distance))
# Save a single object to a file
saveRDS(df_meetups, "df_meetups.rds")
I create these two helper functions: greater and lesser to help clean up my code later when I have to compare timestamps and return the greater and lesser of two timestamps.
I filter the rows in df_meetups so the distance between the two geometries is less than 100 meters. This ensures that the two cars were in the same general area.
What we’re now left with are POIs that overlap in time and happened in the same area (within 100 meters). I’ll call these “meet ups” moving forward.
I calculate the overlap in time between one person’s arrival and departure and another person’s arrival and departure.
df_meetups_filtered <- df_meetups %>%
filter(distance <= 100) %>%
select(
full_name.x,
ArrivalTimestamp.x,
DepartureTimestamp.x,
CurrentEmploymentType.x,
CurrentEmploymentTitle.x,
full_name.y,
ArrivalTimestamp.y,
DepartureTimestamp.y,
CurrentEmploymentType.y,
CurrentEmploymentTitle.y,
distance,
day_timestamp.x,
day_of_week.x,
hour_of_day.x
) %>%
rename(
day_timestamp = day_timestamp.x,
day_of_week = day_of_week.x,
hour_of_day = hour_of_day.x
) %>%
mutate(overlap = (lesser(DepartureTimestamp.x, DepartureTimestamp.y)
- greater(ArrivalTimestamp.x, ArrivalTimestamp.y)) / 60)
I pass df_meetups_filtered to ggraph to graph a network of POIs between people who were in the same time and location.
network <- df_meetups_filtered %>%
filter(hour_of_day <= 6) %>%
select(
full_name.x,
full_name.y,
overlap
) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = overlap)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()
network

network <- df_meetups_filtered %>%
filter(hour_of_day >= 21) %>%
select(
full_name.x,
full_name.y,
overlap
) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = overlap)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()
network

network <- df_meetups_filtered %>%
filter(day_of_week %in% c("Saturday", "Sunday")) %>%
select(
full_name.x,
full_name.y,
overlap
) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = overlap)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()
network

network <- df_meetups_filtered %>%
filter(overlap >= 60 * 6, overlap <= 60 * 8) %>%
select(
full_name.x,
full_name.y,
overlap
) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = overlap)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()
network

network <- df_meetups_filtered %>%
filter(overlap >= 60 * 8) %>%
select(
full_name.x,
full_name.y,
overlap
) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = overlap)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()
network

# library(visNetwork)
#
# list_of_names <- unique(df_cars$full_name)
# n_names <- length(list_of_names)
#
# nodes <- data.frame(id = list_of_names,
# label = list_of_names)
#
# edges <- df_meetups_filtered %>%
# select(
# full_name.x,
# full_name.y
# ) %>%
# rename(
# from = full_name.x,
# to = full_name.y
# )
#
# visNetwork(nodes, edges)
FALSE
[1] FALSE
FALSE
[1] FALSE
FALSE
[1] FALSE
FALSE
[1] FALSE
FALSE
[1] FALSE
gps_path <- gps_sfx %>%
group_by(id) %>%
summarize(m = mean(Timestamp),
do_union = FALSE) %>%
st_cast("LINESTRING")
gps_path_selected <- gps_path %>%
filter(id == 1)
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b =3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_selected) +
tm_lines()
# Facets map
gps_dots_selected <- sf_poi %>%
mutate(MinutesDuration = round(Timestamp_diff_seconds / 60, 2)) %>%
select(id,
ArrivalTimestamp,
DepartureTimestamp,
MinutesDuration,
LastName,
FirstName,
CurrentEmploymentType,
CurrentEmploymentTitle)
tmap_mode("view")
m <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b =3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_dots_selected) +
tm_dots(col = 'red', border.col = 'gray', size = .1, alpha = 0.3, jitter = 1) +
tm_facets(by="id", ncol = 2)
m
gps_dots_selected <- gps_sf %>%
filter(CurrentEmploymentType == "Security") %>%
mutate(MinutesDuration = round(Timestamp_diff_seconds / 60, 2)) %>%
select(id,
ArrivalTimestamp,
DepartureTimestamp,
MinutesDuration,
full_name,
CurrentEmploymentType,
CurrentEmploymentTitle,
gps_geometry)
tmap_mode("view")
m <- tm_shape(bgmap) +
tm_rgb(bgmap, r = 1, g = 2, b =3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_dots_selected) +
tm_dots(col = 'red', border.col = 'gray', size = 0.007, alpha = 0.3)
m